{
    "cells": [
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "<center>\n",
                "    <p style=\"text-align:center\">\n",
                "        <img alt=\"phoenix logo\" src=\"https://storage.googleapis.com/arize-phoenix-assets/assets/phoenix-logo-light.svg\" width=\"200\"/>\n",
                "        <br>\n",
                "        <a href=\"https://arize.com/docs/phoenix/\">Docs</a>\n",
                "        |\n",
                "        <a href=\"https://github.com/Arize-ai/phoenix\">GitHub</a>\n",
                "        |\n",
                "        <a href=\"https://arize-ai.slack.com/join/shared_invite/zt-2w57bhem8-hq24MB6u7yE_ZF_ilOYSBw#/shared-invite/email\">Community</a>\n",
                "    </p>\n",
                "</center>\n",
                "\n",
                "# <center>Evaluation using Pydantic Evals</center>\n",
                "\n",
                "Pydantic offers an evaluation library that can be used to run preset direct evaluations, such as whether an output matches a Pydantic model, as well as LLM Judge evaluations. These evals can be run directly over dataframes of cases defined with Pydantic. However, you may want to run evaluations over real traces as opposed to presaved cases.\n",
                "\n",
                "This notebook shows you how you can use Pydantic Evals alongside Arize Phoenix to run evals on traces captured from your running application.\n",
                "\n",
                "<img width=500px src=\"https://storage.googleapis.com/arize-phoenix-assets/assets/images/pydantic-eval-diagram.png\" />\n",
                "\n",
                "*Note: Phoenix does include its own evals package, however it is designed to work with other eval packages like Pydantic Evals as well.*"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {
                "id": "857d8bf104ed"
            },
            "source": [
                "## Install dependencies"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "!pip install -q pydantic-evals arize-phoenix openai openinference-instrumentation-openai \"httpx<0.28.0,>=0.23.0\""
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## Setup API keys and imports"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 2,
            "metadata": {},
            "outputs": [],
            "source": [
                "import os\n",
                "from getpass import getpass\n",
                "\n",
                "from openai import OpenAI\n",
                "from pydantic_evals import Case, Dataset\n",
                "\n",
                "import phoenix as px\n",
                "\n",
                "if os.getenv(\"OPENAI_API_KEY\") is None:\n",
                "    os.environ[\"OPENAI_API_KEY\"] = getpass(\"Enter your OpenAI API key: \")"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## Enable Phoenix Tracing"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "Sign up for a free instance of [Phoenix Cloud](https://app.phoenix.arize.com) to get your API key. If you'd prefer, you can instead [self-host Phoenix](https://arize.com/docs/phoenix/deployment)."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 3,
            "metadata": {},
            "outputs": [],
            "source": [
                "if os.getenv(\"PHOENIX_API_KEY\") is None:\n",
                "    os.environ[\"PHOENIX_API_KEY\"] = getpass(\"Enter your Phoenix API key: \")\n",
                "\n",
                "os.environ[\"PHOENIX_COLLECTOR_ENDPOINT\"] = \"https://app.phoenix.arize.com\"\n",
                "os.environ[\"PHOENIX_CLIENT_HEADERS\"] = f\"api_key={os.getenv('PHOENIX_API_KEY')}\""
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "from phoenix.otel import register\n",
                "\n",
                "tracer_provider = register(\n",
                "    project_name=\"pydantic-evals-tutorial\",\n",
                "    auto_instrument=True,  # because you've imported the openinference-instrumentation-openai package above, this will automatically instrument any OpenAI method calls\n",
                ")"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## Create Example Traces to Evaluate\n",
                "\n",
                "Next, we'll run some example inputs through an LLM call to generate traces that we can evaluate. In practice, you'd likely already have an application you're tracing that you'd want to evaluate instead."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "client = OpenAI()\n",
                "\n",
                "inputs = [\n",
                "    \"What is the capital of France?\",\n",
                "    \"Who wrote Romeo and Juliet?\",\n",
                "    \"What is the largest planet in our solar system?\",\n",
                "]\n",
                "\n",
                "\n",
                "def generate_trace(input):\n",
                "    client.chat.completions.create(\n",
                "        model=\"gpt-4o-mini\",\n",
                "        messages=[\n",
                "            {\n",
                "                \"role\": \"system\",\n",
                "                \"content\": \"You are a helpful assistant. Only respond with the answer to the question as a single word or proper noun.\",\n",
                "            },\n",
                "            {\"role\": \"user\", \"content\": input},\n",
                "        ],\n",
                "    )\n",
                "\n",
                "\n",
                "for input in inputs:\n",
                "    generate_trace(input)"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "You should now see three traces captured in your Phoenix instance. If you don't see them right away, make sure you've selected the `pydantic-evals-tutorial` project."
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## Export Traces from Phoenix\n",
                "\n",
                "Next, you export those traces from Phoenix so that you can evaluate them using Pydantic Evals."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "from phoenix.trace.dsl import SpanQuery\n",
                "\n",
                "query = SpanQuery().select(\n",
                "    input=\"llm.input_messages\",\n",
                "    output=\"llm.output_messages\",\n",
                ")\n",
                "\n",
                "# The Phoenix Client can take this query and return the dataframe.\n",
                "spans = px.Client().query_spans(query, project_name=\"pydantic-evals-tutorial\")\n",
                "spans[\"input\"] = spans[\"input\"].apply(lambda x: x[1].get(\"message\").get(\"content\"))\n",
                "spans[\"output\"] = spans[\"output\"].apply(lambda x: x[0].get(\"message\").get(\"content\"))\n",
                "spans.head()"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## Define the Evaluation Dataset\n",
                "Create a dataset of test cases using Pydantic Evals for a question-answering task.\n",
                "1. Each Case represents a single test with an input (question) and an expected output (answer).\n",
                "2. The Dataset aggregates these cases for evaluation."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 8,
            "metadata": {},
            "outputs": [],
            "source": [
                "cases = [\n",
                "    Case(\n",
                "        name=\"capital of France\", inputs=\"What is the capital of France?\", expected_output=\"Paris\"\n",
                "    ),\n",
                "    Case(\n",
                "        name=\"author of Romeo and Juliet\",\n",
                "        inputs=\"Who wrote Romeo and Juliet?\",\n",
                "        expected_output=\"William Shakespeare\",\n",
                "    ),\n",
                "    Case(\n",
                "        name=\"largest planet\",\n",
                "        inputs=\"What is the largest planet in our solar system?\",\n",
                "        expected_output=\"Jupiter\",\n",
                "    ),\n",
                "]"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## Setup LLM task, Evaluator, and Dataset for Pydantic\n",
                "\n",
                "Pydantic Evals requires a task to run each case through. Since you've already run this task for a given input (represented by the traces you captured above), this case will simply be retrieving the corresponding output from your dataframe of exported traces."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 11,
            "metadata": {},
            "outputs": [],
            "source": [
                "import nest_asyncio\n",
                "\n",
                "nest_asyncio.apply()\n",
                "\n",
                "\n",
                "async def task(input: str) -> str:\n",
                "    output = spans[spans[\"input\"] == input][\"output\"].values[0]\n",
                "    return output"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "Then create a basic evaluator that checks whether the output matches the expected value exactly."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 9,
            "metadata": {},
            "outputs": [],
            "source": [
                "from pydantic_evals.evaluators import Evaluator, EvaluatorContext\n",
                "\n",
                "client = OpenAI()\n",
                "\n",
                "\n",
                "class MatchesExpectedOutput(Evaluator[str, str]):\n",
                "    def evaluate(self, ctx: EvaluatorContext[str, str]) -> float:\n",
                "        is_correct = ctx.expected_output == ctx.output\n",
                "        return is_correct"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 10,
            "metadata": {},
            "outputs": [],
            "source": [
                "dataset = Dataset(\n",
                "    cases=cases,\n",
                "    evaluators=[MatchesExpectedOutput()],\n",
                ")"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## Run your experiment and evaluation\n",
                "\n",
                "Now with everything connected up, you can run your evaluation using Pydantic:"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "report = dataset.evaluate_sync(task)\n",
                "print(report)"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## Redefine Eval to be LLM-powered or Semantic\n",
                "\n",
                "That evaluation works fine, however the exact match is a bit too strict to work in a real world setting. Try adding two other kinds of evaluators, a fuzzy match eval and an LLM judge eval."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 13,
            "metadata": {},
            "outputs": [],
            "source": [
                "class FuzzyMatchesOutput(Evaluator[str, str]):\n",
                "    def evaluate(self, ctx: EvaluatorContext[str, str]) -> float:\n",
                "        # Using fuzzy matching to compare expected and actual outputs\n",
                "        from difflib import SequenceMatcher\n",
                "\n",
                "        def similarity_ratio(a, b):\n",
                "            return SequenceMatcher(None, a, b).ratio()\n",
                "\n",
                "        # Consider it correct if similarity is above 0.8 (80%)\n",
                "        is_correct = similarity_ratio(ctx.expected_output, ctx.output) > 0.8\n",
                "        return is_correct\n",
                "\n",
                "\n",
                "dataset.add_evaluator(FuzzyMatchesOutput())"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 14,
            "metadata": {},
            "outputs": [],
            "source": [
                "from pydantic_evals.evaluators import LLMJudge\n",
                "\n",
                "dataset.add_evaluator(\n",
                "    LLMJudge(\n",
                "        rubric=\"Output and Expected Output should represent the same answer, even if the text doesn't match exactly\",\n",
                "        include_input=True,\n",
                "        model=\"openai:gpt-4o-mini\",\n",
                "    ),\n",
                ")"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "report = dataset.evaluate_sync(task)\n",
                "print(report)"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "You should now see that the LLM Judge at least catches that \"Shakespeare\" and \"William Shakespeare\" represent the same answer."
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## Upload Labels to Phoenix\n",
                "\n",
                "As a final step, you can now upload your eval results to Phoenix to capture them in the UI."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "results = report.model_dump()"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 17,
            "metadata": {},
            "outputs": [],
            "source": [
                "# Create a dataframe for each eval\n",
                "meo_spans = spans.copy()\n",
                "fuzzy_label_spans = spans.copy()\n",
                "llm_label_spans = spans.copy()\n",
                "\n",
                "for case in results.get(\"cases\"):\n",
                "    # Phoenix expects a \"label\" column, so start by extracting the eval result from each row\n",
                "    meo_label = case.get(\"assertions\").get(\"MatchesExpectedOutput\").get(\"value\")\n",
                "    fuzzy_label = case.get(\"assertions\").get(\"FuzzyMatchesOutput\").get(\"value\")\n",
                "    llm_label = case.get(\"assertions\").get(\"LLMJudge\").get(\"value\")\n",
                "\n",
                "    input = case.get(\"inputs\")\n",
                "\n",
                "    # Update the label in each dataframe where the input value matches\n",
                "    meo_spans.loc[meo_spans[\"input\"] == input, \"label\"] = str(meo_label)\n",
                "    fuzzy_label_spans.loc[meo_spans[\"input\"] == input, \"label\"] = str(fuzzy_label)\n",
                "    llm_label_spans.loc[llm_label_spans[\"input\"] == input, \"label\"] = str(llm_label)\n",
                "\n",
                "# Phoenix can also take in a numeric score for each row which it uses to calculate overall metrics\n",
                "meo_spans[\"score\"] = meo_spans[\"label\"].apply(lambda x: 1 if x else 0)\n",
                "fuzzy_label_spans[\"score\"] = fuzzy_label_spans[\"label\"].apply(lambda x: 1 if x else 0)\n",
                "llm_label_spans[\"score\"] = llm_label_spans[\"label\"].apply(lambda x: 1 if x else 0)"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "meo_spans.head()"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "from phoenix.client import AsyncClient\n",
                "\n",
                "px_client = AsyncClient()\n",
                "await px_client.spans.log_span_annotations_dataframe(\n",
                "    dataframe=meo_spans,\n",
                "    annotation_name=\"Direct Match Eval\",\n",
                "    annotator_kind=\"LLM\",\n",
                ")\n",
                "await px_client.spans.log_span_annotations_dataframe(\n",
                "    dataframe=fuzzy_label_spans,\n",
                "    annotation_name=\"Fuzzy Match Eval\",\n",
                "    annotator_kind=\"LLM\",\n",
                ")\n",
                "await px_client.spans.log_span_annotations_dataframe(\n",
                "    dataframe=llm_label_spans,\n",
                "    annotation_name=\"LLM Match Eval\",\n",
                "    annotator_kind=\"LLM\",\n",
                ")"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "![results_in_phoenix](https://storage.googleapis.com/arize-phoenix-assets/assets/images/pydantic-evals-results.png)"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "#### For more on LLM Evaluation, check out our [Arize Master Guide to LLM Evaluation](https://arize.com/llm-evaluation)!"
            ]
        }
    ],
    "metadata": {
        "kernelspec": {
            "display_name": "phoenix",
            "language": "python",
            "name": "python3"
        },
        "language_info": {
            "codemirror_mode": {
                "name": "ipython",
                "version": 3
            },
            "file_extension": ".py",
            "mimetype": "text/x-python",
            "name": "python",
            "nbconvert_exporter": "python",
            "pygments_lexer": "ipython3",
            "version": "3.11.10"
        }
    },
    "nbformat": 4,
    "nbformat_minor": 0
}
